I used the clean dataset produced in the “Colorado_Fulldataset” document in order to analyze only general public response in Twitter, since this dataset already exluded tweets sent by official agencies and bots. The total number of tweets in the dataset is 3858.
Tweets sent during the Flood and Inmediate Aftermath phases of the disaster were filtered. This means that 45% of the tweets were excluded. This is we will use 2132 tweets from the original 3858 total.
Before any spatial analysis or plotting, the data was first projected in North America Lambert Conformal Conic.
## Coordinate Reference System:
## No EPSG code
## proj4string: "+proj=lcc +lat_1=20 +lat_2=0 +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
One of the goals of this study is to confirm if spatially clustered tweets can serve as a proxy for reports from affected areas. So I will repeat the same analysis done for the whole dataset but now considering only tweets belonging to clusters after the spatial clustering process. A hierarchical implementation of dbscan (hdbscan) was used for the spatial clustering. For hdbscan we need to pick a number of minimum points to be considered to identify the cluster. When setting that number with any value between 159 and 225, clusters in Colorado were identify. So we picked the minimum value in that range which retains the maximum number of tweets: 160.
After the spatial filtering only tweets sent during the Flood stage only 30% of the total tweets were retained. 1172 tweets from the 3858 total.
A quick view of the most common words in the whole dataset:
## # A tibble: 3,070 x 2
## word n
## <chr> <int>
## 1 boulder 734
## 2 boulderflood 355
## 3 cowx 158
## 4 flood 115
## 5 coflood 113
## 6 colorado 105
## 7 rain 92
## 8 flooding 79
## 9 creek 76
## 10 amp 61
## # … with 3,060 more rows
Again, since “boulder” is the most common word and is going to have a big effect in our topic modelling, it was removed from the dataset. The term “boulderflood”) was also excluded because it was so common and used neutrally in all four stages. After excluding the two terms, the new list of common words looks as follows:
## # A tibble: 3,068 x 2
## word n
## <chr> <int>
## 1 cowx 158
## 2 flood 115
## 3 coflood 113
## 4 colorado 105
## 5 rain 92
## 6 flooding 79
## 7 creek 76
## 8 amp 61
## 9 rt 55
## 10 water 52
## # … with 3,058 more rows
Again, after playing with different numbers, I decided to train a topic model with 15 topics. From 16 on, topics started to look very similar (with the same bag of words). Here is a summary of the results after this process:
Mapping topic 12 to see spatial distribution:
mapview(tweet_and_topic_geo, zcol = "topic", layer.name = "topic", burst = TRUE) +
mapview(affected_counties_p)